ZAC.PB: An Annotated Corpus for Zero Anaphora Resolution in Portuguese
نویسنده
چکیده
This paper describes the methodology adopted in the construction of an annotated corpus for the study of zero anaphora in Portuguese, the ZAC corpus. To our knowledge, no such corpus exists at this time for the Portuguese language. The purpose of this linguistic resource is to promote the use of automatic discovery of linguistic parameters for anaphora resolution systems. Because of the complexity of the linguistic phenomena involved, a detailed description of the different situations is provided. This paper will only focus on the annotation of subject zero anaphors. The main issues regarding zero anaphora in Portuguese are: indefinite subjects, either without verbal agreement marks or with first person plural or third person plural verbal agreement; position of the anaphor relative to its antecedent, i.e. anaphoric and cataphoric relations; coreference chains inside the same sentence and spanning several sentences; and determining the head of the antecedent noun phrase for a given anaphor. Finally, preliminary observations taken from the ZAC corpus are presented.
منابع مشابه
ZAC: Zero Anaphora Corpus A Corpus for Zero Anaphora Resolution in Portuguese
This paper describes a corpus of Brazilian Portuguese texts built in view of the construction of an Anaphora Resolution system, which is part of a fully-fledged Natural Language Processing system (STRING). The ZAC corpus is aimed at the resolution of the so-called zero-anaphora, that is, an anaphora relation where the anaphoric expression (or anaphor) has been zeroed The paper briefly discusses...
متن کاملA Discriminative Approach to Japanese Zero Anaphora Resolution with Large-scale Lexicalized Case Frames
We present a discriminative model for Japanese zero anaphora resolution that simultaneously determines an appropriate case frame for a given predicate and its predicate-argument structure. Our model is based on a log linear framework, and exploits lexical features obtained from a large raw corpus, as well as non-lexical features obtained from a relatively small annotated corpus. We report the r...
متن کاملZero Pronominal Anaphora Resolution for the Romanian Language
This paper presents a new study on the distribution, identification, and resolution of zero pronouns in Romanian. A Romanian corpus, including legal, encyclopaedic, literary, and news texts has been created and manually annotated for zero pronouns. Using a morphological parser for Romanian and machine learning methods, experiments were performed on the created corpus for the identification and ...
متن کاملSupporting Anaphor Resolution In Dialogues With A Corpus-Based Probabilistic Model
This paper describes a corpus-based investigation of anaphora in dialogues, using data from English and Portuguese faceto-face conversations. The approach relies on the manual annotation of a significant number of anaphora cases around three thousand for each language in order to create a database of real-life usage which ultimately aims at supporting anaphora interpreters in NLP systems. Each ...
متن کاملIKAR: An Improved Kit for Anaphora Resolution for Polish
This paper presents Improved Kit for Anaphora resolution (IKAR) – a hybrid system for anaphora resolution for Polish that combines machine learning methods with hand written rules. We give an overview of anaphora types annotated in the corpus and inner workings of the system. The preliminary experiments evaluating IKAR resolution performance are discussed. We have achieved promising results usi...
متن کامل